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! Environment ! 


Open Question: 
What can we not do with 
Deep Learning? 


sensors 






percepts 








actions 








effectors 
Effector 
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Formal tasks: Playing board games, 
card games. Solving puzzles, 
mathematical and logic problems. 


Expert tasks: Medical diagnosis, 
engineering, scheduling, computer 
hardware design. 


Mundane tasks: Everyday speech, 
written language, perception, 
walking, object manipulation. 


Human tasks: Awareness of self, 
emotion, imagination, morality, 
subjective experience, high-level- 
reasoning, consciousness. 
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GPS 


Camera Radar 
(Visible, Infrared) 


| Networking 
Effector Stereo Camera Microphone (Wired, Wireless) IMU 
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Output 
(object identity) 


3rd hidden layer 


(object parts) 


2nd hidden layer 
(corners and 
contours) 







Ist hidden layer 
(edges) 


Visible layer 


(input pixels) 
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Machine Learning 


awww oo o oom om SK m omn 


| Knowledge | Í 
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Effector 
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Image Recognition: Audio Recognition: 
If it looks like a duck Quacks like a duck 





Activity Recognition: 
Swims like a duck 
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Boston Dynamics 





Effector 
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Open Question: 
How much of this Al stack 
can be learned? 





Effector 
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Open Question: 
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Open Question: 
How much of this Al stack 
can be learned? 





aa H 


awww -á á- A 





| Action | 
Effector 
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Open Question: 
How much of this Al stack 
can be learned? 





aa H 
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| Action | 
Effector 
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Types of Deep Learning 


O 


m 





* as z: 
learning rate a 
inverse temperature p 


discount rate y O bse rvatio n | 
Supervised Semi-Supervised Unsupervised 
Learning Learning Reinforcement Learning 
Learning 
1 
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DeepTraffic: Deep Reinforcement Learning Competition 


DeepT raffic 


Main Page - Leaderboard - About DeepTraffic 
Americans spend 8 billion hours stuck in traffic every year. 


^ Deep neural networks can help! 








5 IanesSide = 3; 
- 5 patchesAhead - 30; = 
m 7 patchesBehind = 

" 8 trainIterations = 10000; 


| 18 // the number of other autonomous vehicles controlled by your network 
a 11 otherAgents - 0; // max of 9 


13 var num inputs = (lanesSide * 2 + 1) * (patchesAhead + patchesBehind); 

















> ð a Apply Code/Reset Net Save Code/Net to File Load Code/Net from File 
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Run Training Start Evaluation Run 


Value Function Approximating Neural Network: 
input(280) fc(50) rel 
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https://selfdrivingcars.mit.edu/deeptraffic 
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DeepTraffic: Deep Reinforcement Learning Competition 


* Competition: https://github.com/lexfridman/deeptraffic 
* GitHub: https://github.com/lexfridman/deeptraffic 


* Paper on arXiv: https://arxiv.org/abs/1801.02805 
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Philosophical Motivation for Reinforcement Learning 


Takeaway from Supervised Learning: 


Neural networks are great at memorization and not (vet) great at 
reasoning. 


Hope for Reinforcement Learning: 


Brute-force propagation of outcomes to knowledge about states 
and actions. This is a kind of brute-force “reasoning”. 
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Agent and Environment 


e At each step the agent: 
* Executes action 
* Receives observation (new state) 
* Receives reward 


e The environment: 
* Receives action 
* Emits observation (new state) 


Reward 





Action 
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Examples of Reinforcement Learning 


Reinforcement learning is a general-purpose framework for decision-making: 
* An agent operates in an environment: Atari Breakout 
* An agent has the capacity to act 

* Each action influences the agent's future state 

* Success is measured by a reward signal 


e Goal is to select actions to maximize future reward 
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Examples of Reinforcement Learning 





Cart-Pole Balancing 

* Goal — Balance the pole on top of a moving cart 

* State — angle, angular speed, position, horizontal velocity 
* Actions — horizontal force to the cart 


* Reward — 1 at each time step if the pole is upright 
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Examples of Reinforcement Learning 





mI Ld 


t 8 


* Goal — Eliminate all opponents 
* State — Raw game pixels of the game 
* Actions — Up, Down, Left, Right etc 


* Reward — Positive when eliminating an opponent, 
negative when the agent is eliminated 
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Examples of Reinforcement Learning 
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Bin Packing 


* Goal - Pick a device from a box and put it into a container 

e State - Raw pixels of the real world 

* Actions - Possible actions of the robot 

* Reward - Positive when placing a device successfully, negative otherwise 
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Markov Decision Process 





So, Un, T1, 51,04, T5, ..., 541, An—1) Try Sn 


state | Terminal state 





action 
reward 
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Major Components of an RL Agent 


An RL agent may include one or more of these components: 
* Policy: agent's behavior function 
* Value function: how good is each state and/or action 


* Model: agent's representation of the environment 


So, Un, Y1, 51,04, T5, ..., 541, An—1) Try Sn 


t f 
state | Terminal state 


action 


reward 
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Robot in a Room 


UP 


80% 
10% 
10% 


reward +1 at [4,3], -1 at [4,2] 


reward -0.04 for each step 


what's the strategy to achieve max reward? 


what if the actions were deterministic? 
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actions: UP, DOWN, LEFT, RIGHT 


move UP 
move LEFT 
move RIGHT 
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Is this a solution? 





* only if actions deterministic 
* not in this case (actions are stochastic) 


e solution/policy 
* mapping from each state to an action 
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Optimal policy 
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Reward for each step -2 
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Reward for each step: -0.1 
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Reward for each step: -0.04 
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Reward for each step: -0.01 
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Reward for each step: +0.01 
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Value Function 


* Future reward R= ry +12 +r13+**"+ Ty 


Kt = % +141 tn Fo + | 


e Discounted future reward (environment is stochastic) 


Re= Tet Yre FY TF ty" ry 


— Tp v(Te41 + y(rua to) 
= Tt + YVRt+1 


* A good strategy for an agent would be to always choose 
an action that maximizes the (discounted) future reward 
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Q-Learning 


e State-action value function: Q%(s,a) 


* Expected return when starting in s, 
performing a, and following 7 





* Q-Learning: Use any policy to estimate Q that maximizes future reward: 
e Q directly approximates Q* (Bellman optimality equation) 
* Independent of the policy being followed 
* Only requirement: keep updating each (s,a) pair 


Learning Rate Discount Factor 





Old State 


5rjs.cn [OOOO 








MIT 6.5094: Deep Learning for Self-Driving Cars Lex Fridman January 
https://selfdrivingcars.mit.edu lex.mit.edu 2018 


Exploration vs Exploitation 


e Key ingredient of Reinforcement Learning 


e — Deterministic/greedy policy won't explore all actions 
e Don’t know anything about the environment at the beginning 
e Need to try all actions to find the optimal one 


e Maintain exploration 
e Use soft policies instead: 7(5,a)>0 (for all s,a) 


e  £-greedy policy 
e With probability 1-e perform the optimal/greedy action 
e With probability € perform a random action 


e Will keep exploring the environment 
e Slowly move it towards greedy policy: £ -> 0 
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Q-Learning: Value Iteration 





Learning Rate Discount Factor 


Qt+1(5t. at) = Qe(st, at) +a (Resa + y max Qr(st+1, a) — Quist. at) ) 


Old State Reward 





initialize Q[num states,num actions] arbitrarily 


observe initial state s 








repeat 
select and carry out an action a 
observe reward r and new state s' 
Q[s,a] = Q[s,a] + alr + y max, QO[s',a'] - O[s,al) 
S = 5! 
until terminated 
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Q-Learning: Representation Matters 


* In practice, Value Iteration is impractical 
* Very limited states/actions 
* Cannot generalize to unobserved states 








e Think about the Breakout game 
* State: screen pixels 


* |mage size: 84 x 84 (resized) 
e Consecutive 4 Images 25694X94x4 rOWS in the Q-table! 


* Grayscale with 256 gray levels 


5rjs.cn [OOOO 


Em QNUM MIT 6.5094: Deep Learning for Self-Driving Cars Lex Fridman January 
| i | | i ee Refe rences: [83, 84] https://selfdrivingcars.mit.edu lex.mit.edu 2018 





Philosophical Motivation for Deep Reinforcement Learning 


Takeaway from Supervised Learning: 


Neural networks are great at memorization and not (vet) great at 
reasoning. 


Hope for Reinforcement Learning: 


Brute-force propagation of outcomes to knowledge about states 
and actions. This is a kind of brute-force “reasoning”. 


Hope for Deep Learning + Reinforcement Learning: 


General purpose artificial intelligence through efficient 
generalizable learning of the optimal thing to do given a 
formalized set of actions and states (possibly huge). 
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Deep Learning is Representation Learning 


(aka Feature Learning) 


Output 
(object identity) 


3rd hidden layer 
(object parts) 


2nd hidden layer 
(corners and 
contours) 











Machine 
Learning 


Ist hidden layer 
(edges) 






Artificial 
Intelligence 


Visible layer 
(input pixels) 





Intelligence: Ability to accomplish complex goals. 
Understanding: Ability to turn complex information to into simple, useful information. 
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Deep Q-Learning 


Use a function (with parameters) 
| Function Q(s,a) 
to approximate the Q-function Approximator . — targets or errors 





e Linear 
e Non-linear: Q-Network 


Q(s,a; 0) = Q*(s,a) 


Q-value 1 


Network h Network € Q-value 2 





Q-value 3 
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Deep Q-Network (DQN): Atari 
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Mnih et al. "Playing atari with deep reinforcement learning." 2013. 
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Deep Q-Network Training 


* Bellman Equation: 


Q(s,a) =r+ymax, Q(5,a) 


* Loss function (squared error): 


L = E[(r + ymax,/Q(s', a") — Q(s,a))*] 
FS y 
target 
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DQN Training 


Q-value 


Given a transition < s, a, r, sí >, the Q-table update rule in the 
previous algorithm must be replaced with the following: 


e Do a feedforward pass for the current state s to get 
predicted Q-values for all actions 





e Do a feedforward pass for the next state s” and calculate 


Network | , 
maximum overall network outputs max „ Q(s; a’) 





* Set Q-value target for action to r + ymax „ Q(s; a’) (use 
the max calculated in step 2). 


e For all other actions, set the Q-value target to the same as originally 
returned from step 1, making the error 0 for those outputs. 


Action 


* Update the weights using backpropagation. 
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DQN Tricks 


* Experience Replay 
* Stores experiences (actions, state transitions, and rewards) and creates 
mini-batches from them for the training process 
* Fixed Target Network 


* Error calculation includes the target function depends on network 
parameters and thus changes quickly. Updating it only every 1,000 
steps increases stability of training process. 


Q(s;,a) T Q(s:, a) + a re - Lis un Q(ss p) - (s.a) 


* Reward Clipping 
* To standardize rewards across games by setting all positive rewards to 
+1 and all negative to -1. 
e Skipping Frames 


e Skip every 4 frames to take action 
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DQN Tricks 


* Experience Replay 


e Stores experiences (actions, state transitions, and rewards) and creates 
mini-batches from them for the training process 


* Fixed Target Network 


* Error calculation includes the target function depends on network 
parameters and thus changes quickly. Updating it only every 1,000 
steps increases stability of training process. 


Q(s;,a) — Q(81,a) +a ina » TOES Q(8t+1;P)|— Q(s.a) 


tent efumaioninthesed magur 
Replay 
Target x 
Breakout 316.8 240.7 
River Raid 7446.6 4102.8 
Seaquest 2894.4 822.6 


Space Invaders 1088.9 
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is fixed 
x x 
x 
10.2 3.2 
2867.7 1453.0 
1003.0 275.8 
373.2 302.0 
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Deep Q-Learning Algorithm 


initialize replay memory D 
initialize action-value function Q with random weights 
observe initial state s 
repeat 
select an action a 
with probability & select a random action 
otherwise select a = argmax,/Q(s,a^) 
carry out action a 
observe reward r and new state sS” 
Store experience «s, a, r, s'» in replay memory D 


sample random transitions «ss, aa, rr, ss'» from replay memory D 
calculate target for each minibatch transition 

if ss' is terminal state then tt - rr 

otherwise tt = rr + ymax,/Q(ss', aa”) 
train the Q network using (tt - Q(ss, aa))^as loss 


S = gS! 
until terminated 
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Atari Breakout 





After After After 
10 Minutes 120 Minutes 240 Minutes 
of Training of Training of Training 
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DQN Results in Atari 
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Game of Go 
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Human expert 





positions 
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AlphaGo (2016) Beat Top Human at Go 


Supervised Learning Reinforcement Learning Self-play data Value network 


policy network policy network 








Computer Programs Calibration Human Players 
J 


DeepMind challenge match | Lee Sedol (9p) 
AlphaGo (Mar 2016) Top player of 


asl AM past decade 


Nature match | Fan Hui (2p) 
AlphaGo (Oct 2015) ea 3-times reigning 
fj | Euro Champion 


Amateur 
humans 


Crazy Stone and Zen 
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Elo Rating 


AlphaGo Zero (2017): Beats AlphaGo 





0 5 10 15 20 25 30 35 40 


=== AlphaGo Zero 40 blocks +++ AlphaGo Lee sees AlphaGo Master 
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AlphaGo Zero Approach 


e Same as the best before: Monte Carlo Tree Search (MCTS) 


* Balance exploitation/exploration (going deep on promising positions or 
exploring new underplayed positions) 


* Use a neural network as "intuition" for which positions to 
expand as part of MCTS (same as AlphaGo) 





a Selection b Expansion C Evaluation d Backup 
Hi HE H 
H- Lg P Lh 
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AlphaGo Zero Approach 


e Same as the best before: Monte Carlo Tree Search (MCTS) 
* Balance exploitation/exploration (going deep on promising positions or 
exploring new underplayed positions) 
* Use a neural network as "intuition" for which positions to 
expand as part of MCTS (same as AlphaGo) 


e "Tricks" 


* Use MCTS intelligent look-ahead (instead of human games) to improve 
value estimates of play options 


* Multi-task learning: "two-headed" network that outputs (1) move 
probability and (2) probability of winning. 
* Updated architecture: use residual networks 
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Americans spend 8 billion hours stuck in traffic every year. 
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Autonomous Driving: A Hierarchical View 


Road network data 


Perceived agents, 
obstacles, and 
signage 


Paden B, Cáp M, Yong SZ, Yershov D, Frazzoli E. "A Survey of Motion Planning and Control Techniques for Self- 


User specifiied 
destination 





Route Planning 


nvironment 


Motion Specification 





Motion Planning 
Estimated pose and 


collision free space 





Reference path or trajectory 


Local Feedback 


Control 
Estimate of vehicle 


state 


Se 


Steering, throttle and brake commands 


Y 


driving Urban Vehicles." IEEE Transactions on Intelligent Vehicles 1.1 (2016): 33-55. 
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Applying Deep Reinforcement Learning 
to Micro-Traffic Simulation 


Time=2702 4 scale=2 1 


Disturb Traffic 
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DeepTraffic: Deep Reinforcement Learning Competition 








pem | å 
72 moh 

Cars Passed: 

195 


Simulation Speed: 


o 
o 
| Í | BEN Massachusetts 


Á | Institute of 
Technology 


DeepT raffic 


Main Page - Leaderboard - About DeepTraffic 
Americans spend 8 billion hours stuck in traffic every year. 
^ Deep neural networks can help! 





5 IlanesSide = 3; 

6 patchesAhead = 39; 

7 patchesBehind = 19; 

a 8 trainIterations = 10000; 


18 // the number of other autonomous vehicles controlled by your network 
11 otherAgents = 9; // max of 9 





13 var num inputs = (lanesSide * 2 + 1) * (patchesAhead + patchesBehind); 








a =” Apply Code/Reset Net Save Code/Net to File Load Code/Net from File 


Submit Model to Competition 


; hade 
RE 


w 


óodód99ooorr 


à 


ok oak o2k 03k o4k o.k oék os7k o8k ok sk 


wok 
" 


Un 





À CATs 
| må | E | 
- Value Function Approximating Neural Network: 
input(280) fc(50) 
BUB | l l s l | Kl | | lú 
[fele Tee eg TT Terje I ler jeje) ttt 
ul a | s ES Eee BH mA 
d LOAD CUSTOM IMAGE ar feet I fee I feer jeger 1 161 1 1 1190 18 EEL] 
jer fær I m | 181 1 aa l a pel | 
n TT Ee Bs EE | E | ea BIETEN 
ETT TT TT TT TT TT TT m Em 
rod Ý ce I fejelel Jejet Jejsjel jes] (sjejel sy gH m 
Road Overlay: e W NB EN HB NH OT s 
ær fel | s jes! esel je laa} EH | 
kone = nat ui feel lese] sl I 180 1 s | w 
are). [sis feel RI fet fel l E LEL 18 


REQUEST VISUALIZATION 


Fast v vehicle skins 








NY MY ATA 


https://selfdrivingcars.mit.edu/deeptraffic 


Goal: Achieve the highest average speed over a long period of time. 
Requirement for Students: Follow tutorial to achieve a speed of 65mph 
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What You Should Do 


* [o compete: 


* Read the tutorial: https://selfdrivingcars.mit.edu/deeptraffic-about 


* Change parameters in the code box. 


* Click "Apply Code" white button. Apply Code/Reset Net 


* Click "Run Training" blue button. 


e Click "Submit Model to Competition". ` submit Model to Competition 





* And to visualize your submission for sharing with others: 


e Customize your image vehicle. | Lead Custom Image 
* Customize your color scheme. Red M 


* Click "Request Visualization". Request Visualization 
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The Road, The Car, The Speed 


x x 


ia p 





Speed: 


80 mph x 
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The Road, The Car, The Speed 


= State Representation: 
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i, 
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a B 


Speed: 


47 mph 





Cars Passed: 
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Road Overlay: 
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Simulation Speed 


simulation Speed: 


Normal $ 
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Display Options 


















































































































= o = HE EE NNI 
CI = m fma jest ENE 
= E a a VEN EE 
SI n pep sd (^ | ` 
Et i LI LILI LL 
"m TT a a 
- BERE BENE NEN =I 
e "^ FG m ON OE — a 
"m m a oe | [y 
= ON = if“ 7-1 _L_ = Hmm 
"m um A > E a 
- - NENNEN Ls NES NN NN 
a — ø 
Ez ME = - EE ERN MN 
ENEN| TT 
EE "US KN KE 
= - I1 I1. LI. 
_ - Å 
< = 
Å = 
= (^ 
- = x 
Road Overlay: Road Overlay: Road Overlay: Road Overlay: 
None * Learning Input + Safety System + Full Map - 


5rjs.cn A 


i! | i T seg For the full updated list of references visit: MIT 6.5094: Deep Learning for Self-Driving Cars Lex Fridman January 


Technology https://selfdrivingcars.mit.edu/references https://selfdrivingcars.mit.edu lex.mit.edu 2018 





"Safety System”: Motion and Control are Given 
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Safety System 
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Learning the "Behavioral Layer" Task 
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Learning the "Behavioral Layer" Task 


Qd' Deep I raffic 


" | cars.mit.edu/deeptraffic 






Value Function Approximating Neural Network: 


input(140 ) 
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Action Space 


var noAction = 0; 

var accelerateAction 
var decelerateAction 
var goLeftAction = 3 
var goRightAction - 
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Driving / Learning 
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a mmm [) learn = function (state, lastReward) { 
3 brain.backward(lastReward); 


var action - brain.forward(state): 
return action; 
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Learning Input 
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lanesSide = 1; lanesSide = 2: lanesSide = 1: 
patchesAhead = 10; patchesAhead = 10; patchesAhead - 10; 
patchesBehind - 0; patchesBehind - 0; patchesBehind - 10; 
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Multiple Agents 


// the number of other autonomous vehicles controlled by your network 
otherAgents - 0; // max of 9 








Q 9 
o : 3 à 0 
r MEE NM. : 
7 i . 6 E 
m Oo 8 
a Q - . - 
C i 3 
à 258 - 5 
: Q 8 "n 
a M. . E" : qa ? og 
z 0 89 mg A ? 
8 I R À 2 8 i. 
i ; |! ! ge JE 
| À f: 
| - à » Å ep? Ë | 
e! | EF | 8 E 
a | T. Ë 
080 "MEN Gu - 
REN a 2 à å 
1 2 4 6 8 10 
5rjs.cn [00000 
Ip se" Miren ouis i M a 


Deep RL: Q-Function Learning Parameters 


Value Function Approximating Neural Network: 


input(135) fc(10) relu(10)fc(5)  regression(5) 
HERES UH iN 
















var num inputs = (lanesSide * 2 + 1) x (patchesAhead + patchesBehind); 
var num actions - 5; 

var temporal window - 3; 

var network size = num inputs * temporal window + num actions x 
temporal window + num inputs; 
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Deep RL: Layers 


layer defs.push((1 
type: 'fc', 
num neurons: 10, 
activation: 'relu' 
2 


(10) relu(10) 
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Deep RL: Output (Actions) 





layer defs.push(4 fc(5) —regression(5) 
type: 'regression', 
num neurons: num actions 


r); 
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BEN Massac 


. temporal window 
Experience size - 3000; 
Start learn threshold - 500; 
. gamma = 0.7, 

. learning steps total = 10000; 
. learning steps burnin - 1000; 
.epsilon min - 0.0; 
.epsilon test time - 0.0; 

. Layer defs - layer defs; 
.tdtrainer options = I 


ConvNetJS: Options 


opt = {}; 
temporal_window; 


learning rate: 0.001, momentum: 0.0, batch size: 64, 12 decay: 0.01 


brain - new deepqlearn.Brain(num inputs, num actions, opt); 


MU "dus 
Technology 
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https: 


Coding/Changing the Net Layout 





//«! [CDATA[ 

// a few things don't have var in front of them - they update already 
existing variables the game needs 

lanesSide - 1; 

patchesAhead - 10; 

patchesBehind - 10; 

trainIterations = 100000; 


// begin from convnetjs example 

var num inputs = (lanesSide * 2 + 1) * (patchesAhead + patchesBehind); 

var num actions - 5; 

var temporal window = 3; //1 // amount of temporal memory. 0 = agent lives 
in-the-moment :) 

var network size = num inputs x temporal window + num actions * 





Apply Code/Reset Net 


Watch out: kills trained state! 
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trainIterations = 100000; 


Training 





* Done on separate thread (Web Workers) — 
* Separate simulation, resets, state, etc. 
* Alot faster (1000 fps +) 


* Network state gets shipped to the main 
simulation from time to time 
* You get to see the improvements/learning live 


dot 
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Training 


trainlterations = 100000; 


Run Training 
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Local Evaluation 
(Doesn't Count) 


Evaluation 





* Scoring: Average Speed 
* Method: 


* Collect average speed 
* Ten runs, about 45 (simulated) minutes of game each 
* Result: median speed of the 500 runs 


* Done server side after you submit 


* You can try it locally to get an estimate 
* Uses exactly the same evaluation procedure/code 


e Deeplraffic 2.0: Significantly reduced the influence of 
randomness 
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Average speed: 51 mph 


^" Start Evaluation Run 
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Loading/Saving 


Save Code/Net to File 


* Danger: Overwrites all of your code and the trained net 


Load Code/Net from File 
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Submitting Your Network 


Submit Model to Competition 


e Submits your code and the trained net state 
* Make sure you ran training! 


* Adds your code to the end of a queue 
* Gets evaluated some time soon (no promises when) 


* You can resubmit as often as you like 


* |f your code wasn't evaluated yet it we still remove it from the queue 
(and move you to the end) 


* The highest score counts. 
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Customization and Visualization 





Load Custom Image 
Red Y 


Request Visualization 


Vehicle Skins 
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What You Should Do 


* [o compete: 


* Read the tutorial: https://selfdrivingcars.mit.edu/deeptraffic-about 


* Change parameters in the code box. 


* Click "Apply Code" white button. Apply Code/Reset Net 


* Click "Run Training" blue button. 


e Click "Submit Model to Competition". ` submit Model to Competition 





* And to visualize your submission for sharing with others: 


e Customize your image vehicle. | Lead Custom Image 
* Customize your color scheme. Red M 


* Click "Request Visualization". Request Visualization 
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DeepTraffic: Deep Reinforcement Learning Competition 


* Competition: https://github.com/lexfridman/deeptraffic 
* GitHub: https://github.com/lexfridman/deeptraffic 


* Paper on arXiv: https://arxiv.org/abs/1801.02805 
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